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Abstract.  The  past  few  decades  have  witnessed  the  booming  of  so¬ 
cial  networks,  which  leads  to  a  lot  of  researches  exploring  information 
dissemination.  However,  owing  to  the  insufficient  information  exposed 
before  the  outbreak  of  the  cascade,  many  previous  works  fail  to  fully 
catch  its  characteristics,  and  thus  usually  model  the  burst  process  in  a 
rough  manner.  In  this  paper,  we  employ  survival  theory  and  design  a 
novel  survival  perspective  Early  Pattern  detection  model  for  Outbreak 
Cascades  (in  abbreviation,  EPOC),  which  utilizes  information  both  from 
the  static  nature  and  its  later  diffusion  process.  To  classify  the  cascades, 
we  employ  two  Gaussian  distributions  to  get  the  optimal  boundary  and 
also  provide  rigorous  proof  to  testify  its  rationality.  Then  by  utilizing 
both  the  survival  boundary  and  hazard  ceiling,  we  can  precisely  detect 
early  pattern  of  outbreak  cascades  at  very  early  stage.  Experiment  re¬ 
sults  demonstrate  that  under  three  practical  and  special  metrics,  our 
model  outperforms  the  state-of-the-art  baselines  in  this  early- stage  task. 

Keywords:  Early-stage  Detection  •  Outbreak  Cascade  •  Survival  The¬ 
ory  •  Cox’s  Model  •  Social  Networks 


1  Introduction 

The  rapid  development  of  modern  technology  has  changed  the  lifestyles  to  a 
large  extent  compared  to  a  few  years  ago.  Every  day  millions  of  people  express 
ideas  and  interact  with  friends  through  online  platforms  like  Twitter  and  Weibo. 
On  these  platforms,  registered  users  are  able  to  tweet  short  messages  (e.g.,  up 
to  140  characters  in  Twitter),  and  others  who  are  interested  in  it  will  give  likes, 
comments,  or  more  commonly,  retweets.  Such  retweeting  would  potentially  dis¬ 
seminate  and  further  spread  information  to  a  large  number  of  users,  which  forms 
a  cascade  [I]  .  While  the  cascade  grows  larger  and  get  more  individuals  involved, 
a  sudden  burst  will  definitely  arrive,  which  we  call  a  spike.  As  a  matter  of  fact, 
detecting  and  predicting  the  burst  pattern  of  a  cascade,  especially  at  early  stage, 
attract  lots  of  attention  in  various  domains:  meme  tracking  [2],  stock  bubble  di¬ 
agnosis  [3],  and  sales  prediction  [4],  etc. 
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(a)  Cascade  Life  Cycle  (b)  Retweeting  @Cascade2  (c)  Survival  Curve  @Cascade2 


Fig.  1.  Samples  of  Cascade  Diffusion  on  Twitter 


However,  to  fully  understand  the  burst  pattern  of  cascades  ahead  of  time 
will  meet  three  major  challenges.  First  and  foremost,  due  to  the  deficiency  of 
available  information  and  its  disorder  nature  at  early  stage  [5],  one  can  hardly 
catch  distinguishing  signs  on  whether  a  cascade  will  break  out.  The  second 
challenge  stems  from  the  significantly  distinct  life  span  of  different  cascades  [6] , 
which  makes  it  tough  to  extract  typical  features.  Worse  still,  this  distinctiveness 
makes  it  hard  for  researchers  to  set  suitable  observation  time,  owing  to  the 
variety  of  life  spans.  The  third  challenge  is  that  the  burst  pattern  of  cascades 
usually  follows  a  quick  rise  and  fall  law  [7],  which  lasts  a  few  minutes  but  causes 
magnificent  influence.  In  this  situation,  the  correlations  between  the  history  and 
the  near  future  can  be  hardly  characterized  by  traditional  models. 

Shown  in  Figure  [l(a)|  we  plot  the  diffusion  process  of  seven  real-world  cas¬ 
cades  from  Twitter.  We  can  see  that  @Cascade2  shares  almost  the  same  pattern 
with  ©Cascade  1  before  it  outbreaks  at  time  to,  which  means  that  it  is  hard  for 
us  to  catch  the  distingushing  signs  using  the  early  information.  As  the  second 
challenge  states,  ©Cascaded  7  represent  different  life  span  at  early  stage.  While 
©Cascade6  ends  its  diffusion,  ©Cascade3  is  just  about  to  start  propagation, 
and  it  still  enlarges  even  at  the  end  of  observation.  The  third  challenge  can  be 
vividly  described  in  Figure [l(b)|  where  we  focus  on  ©Cascade2  and  plot  how  it 
is  retweeted.  Figure  [T(b)|  shows  that  ©Cascade2  experiences  a  mild  propagation 
when  it  appears,  but  after  time  to,  it  goes  through  two  large  retweeting  spikes 
(sudden  falls  in  survival  curve  ploted  in  Figure  [T(c)|),  and  the  final  amount  of 
retweeting  explodes  to  about  1600  during  the  burst  period.  These  three  core 
challenges  motivate  us  to  design  a  model  that  can  handle  this  quick  rise  and  fall 
pattern,  characterize  different  cascades  uniformly,  and  detect  the  burst  pattern 
as  early  as  possible. 

Motivated  by  the  study  of  death  in  biological  organisms,  in  this  paper,  we 
regard  the  diffusion  of  cascades  as  the  growing  process  of  biological  organisms. 
Since  Cox’s  model  is  widely  used  to  characterize  the  life  span  of  biological  organ¬ 
isms,  here  we  adopt  Cox’s  model  with  the  knowledge  of  cascades,  transforming 
the  burst  detection  task  into  diagnosis  of  cascade  life  table,  and  then  we  build 
a  survival  perspective  Early  Pattern  detection  model  for  Outbreak  Cascades,  in 
abbreviation,  EPOC.  Though  previous  work  [8]  has  also  tried  Cox’s  model,  their 
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work  is  mainly  based  on  unsubstantiated  observations  as  well  as  only  taking  one 
feature  into  consideration,  which  does  not  address  the  above  challenges  at  all. 

In  our  EPOC,  to  consider  the  influential  factors  from  different  perspectives, 
we  harness  three  features  from  each  cascade  (retweet  sequence,  follower  number 
sequence,  and  original  timestamp)  to  capture  the  effectiveness  of  temporal  infor¬ 
mation  [9] ,  the  influence  of  involved  users  m,  and  the  dynamics  of  user  activity 
HQ  .  Then,  to  study  the  distinctiveness  of  cascades’  life  span,  we  train  an  effective 
Cox’s  model  and  employ  two  Gaussian  distributions  to  fit  the  survival  probability 
of  viral  and  non-viral  cascades  at  different  time  point  respectively,  and  obtain¬ 
ing  a  survival  boundary  between  the  viral  and  the  non-viral,  which  is  further 
proven  to  be  well-defined  theoretically.  Finally,  as  the  static  and  dynamic  nature 
of  cascade  diffusion  are  both  important  indicators  of  cascade  virality,  we  jointly 
consider  survival  probability  and  hazard  rate,  which  considerably  enhances  our 
model’s  performance  in  handling  the  quick  rise  and  fall  pattern.  We  then  employ 
three  special  metrics  (K-coverage,  Cost,  Time  ahead)  to  compare  EPOC  with 
two  basic  machine  learning  methods  (LR,  SVR)  and  three  powerful  baselines 
published  in  recent  literatures  (PreWhether  [12],  SEISMIC  [TU],  SansNet  m  on 
two  large  real-world  datasets:  Twitter  and  Weibo.  Experiment  results  show  that 
EPOC  outperforms  these  five  methods  in  burst  pattern  detection  at  very  early 
stage. 

Our  main  contributions  are  summarized  as: 

—  We  adopt  survival  theory  and  establish  a  powerful  burst  detection  model 
EPOC  for  cascade  diffusion,  which  can  handle  the  quick  rise-and-fall  pattern 
as  well  as  the  significantly  distinct  life  span  of  cascades  at  the  early  stage. 

—  We  utilize  both  static  and  dynamic  information  from  cascades,  obtain  a 
dimidiate  boundary  with  two  Gaussian  distributions,  and  then  novelly  use 
the  burst  pattern  to  help  predict  the  popularity  of  an  online  content. 

—  We  adopt  three  special  metrics  and  conduct  extensive  experiments  on  two 
large  real-world  data  sets  (Twitter  and  Weibo).  The  results  show  that  EPOC 
gives  the  best  performance  comparing  with  five  state-of-the-art  approaches. 

The  remainder  of  the  paper  is  organized  as  follows.  Some  common  notions 
of  survival  theory  and  the  basic  Cox’s  model  are  introduced  in  Section  [2]  The 
design  of  our  proposed  model  EPOC  is  specified  in  Section  [3]  We  evaluate  and 
analyze  our  model  on  Twitter  and  Weibo  in  Section  [4]  We  review  several  related 
works  in  Section  [5]  Finally,  we  conclude  our  work  and  highlight  the  possible 
future  perspectives  in  Section  [6] 

2  Survival  Analysis  and  Cox’s  Model 

In  this  section,  we  give  some  definitions  about  survival  theory  in  social  networks. 
Initially,  when  a  user  shares  the  content  with  her  set  of  friends,  several  of  these 
friends  share  it  with  their  respective  sets  of  friends,  and  a  cascade  of  resharing 
can  develop  m ■  Once  the  size  of  this  cascade  grows  above  a  certain  threshold 
p,  we  call  it  goes  viral ,  and  otherwise  non-viral  To  quantitively  describe  these 
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statues  of  cascade  diffusion,  we  introduce  survival  function  and  hazard  function 
respectively  in  Definition  [l]  and  Definition  |2j 

Definition  1.  (Survival  Function):  let  S(t )  G  (0,1)  denote  the  survival  proba¬ 
bility  of  cascade  subject  to  time  t,  i.e.,  at  time  t,  cascade  has  the  probability  of 
S(t)  to  be  non-viral,  where  S(t )  is  naturally  monotonic  decreasing  with  time  t. 


Definition  2.  (Hazard  Function):  let  h(t)  G  (0,  oo)  denote  the  hazard  rate  of 
cascade  at  time  t  on  the  condition  that  it  survives  until  time  t  , i.e.,  h(t)  is  the 
negative  derivative  of  survival  probability  —  dSd^  to  the  survival  function  S(t), 
specifically  given  by  the  following  formula, 


h(t) 


d  S{t)  1 
d t  '  S(t ) ' 


(1) 


Since  Cox’s  survival  model  was  proposed  m,  it  has  been  widespread  used 
in  the  analysis  of  time-to-event  data  with  censoring  and  covariates  Pi-  In  this 
work,  we  use  Cox’s  proportional  hazard  model  with  time-dependent  covariates 
(also  called  Cox-extended  model)  to  characterize  the  association  between  early 
information  and  the  cascade  statues  (viral  or  non-viral). 

Basic  Model:  for  cascades  i  =  1,2,  •••  ,  n,  they  share  the  same  baseline 
hazard  function  denoted  as  ho(t),  and  Xi(t)  =  }  denotes  the 

feature  vector  of  the  ith  cascade,  where  ho(t)  does  not  depend  on  each  Xi(t )  but 
only  on  t.  (3  =  {/Ll,/^,  •  •  •  , /3m}  is  the  parameter  vector  of  our  hazard  model. 
We  specify  the  hazard  function  of  ith  cascade  as  follows, 


hi(t)  =  h0(t )  •  exp  ( 3T Xi(t)  . 


(2) 


Because  the  model  is  proportional,  i.e.,  given  it^  and  jth  cascade,  the  relative 
hazard  rate  A ij  can  be  concretely  given  by, 


hi{t)  MG  exp  1 

(YW)) 

1  exp  | 

(pTXi(t)) 

ho(t)  ■  exp  | 

1  exp  | 

[(PXjit)) 

(3) 


where  f3  is  the  parameter  vector,  Xi(t )  and  Xj(t)  are  respectively  the  feature 
vectors  of  ith  and  jth  cascade.  From  Eqn.  |3|,  it  is  easy  to  conclude  that  the 
baseline  hazard  does  not  play  any  role  in  relative  hazard  rate  A ij,  i.e.,  the  model 
is  also  a  semi-parametric  approach.  Therefore,  instead  of  considering  the  absolute 
hazard  function,  we  only  care  about  the  relative  hazard  rate  of  cascades,  which 
only  concerns  parameter  vector  (3.  Then  we  use  Maximum  Likelihood  Estimation 
to  get  parameter  vector  f3.  We  denote  ith  cascade  time-to-event  as  and  assume 
that  0  <  t\  <  £2  <  •  •  •  <  tn-  The  Cox’s  partial  likelihood  is  given  by, 


m  =  n 

i= 1 


hi(ti ) 


Si 


(4) 
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where  Si  means  whether  the  data  from  ith  cascade  is  censored,  i.e.,  if  the  event 
happens  to  ith  cascade,  then  Si  equals  to  1,  and  otherwise  0.  Then  the  log-partial 
likelihood  of  parameter  vector  /3  can  be  calculated  as, 


n 

logL(/3)  =  5> 


f3TXi(ti)  -  log 


n 

E  eXP  {pT  Xj(ti 


(5) 


maximizing  the  log-partial  likelihood  by  solving  equation  dl°^^  =  0,  then  we 
can  get  the  numerical  estimation  of  parameter  vector  /3  using  Newton  method. 


3  EPOC:  detecting  Early  Pattern  of  Outbreak  Cascades 

Based  on  the  basic  model  stated  previously,  in  this  section,  we  combine  the  Cox’s 
model  with  our  knowledge  of  cascades,  and  make  it  suitable  to  handle  the  task 
of  detecting  the  early  pattern  of  outbreak  cascades.  Here  we  regard  cascades 
as  complex  dynamic  objects  that  pass  through  successive  stages  as  they  grow. 
During  this  process  of  growth,  the  survival  probability  and  the  hazard  rate  of 
cascades  will  change  dynamically.  The  high  survival  probability  and  low  hazard 
rate  suggest  that  cascades  are  unlikely  to  be  viral  in  the  future,  while  the  low 
survival  probability  as  well  as  high  hazard  rate  imply  the  opposite.  In  this  sense, 
we  introduce  the  survival  boundary  and  the  hazard  ceiling  to  help  accomplish 
this  challenging  task  at  very  early  stage. 

Feature  Selection:  as  is  stated  previously,  the  effectiveness  of  temporal 
information,  the  influence  of  involved  users,  and  the  dynamics  of  user  activity  are 
all  powerful  indicators  of  the  cascade  statues.  Therefore,  in  this  experiment,  we 
utilize  three  features  accordingly:  timestamp  of  each  retweet ,  number  of  followers 
of  every  user  involved  in  the  cascade,  and  timestamp  of  the  first  tweet. 

3.1  Survival  Boundary:  a  Static  Perspective 


(a)  Survival  Functions  of  Cascades 


(b)  Survival  Boundary 


Fig.  2.  Survival  Functions  and  Survival  Boundary 
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To  detect  the  early  pattern  of  outbreak  cascades,  firstly,  we  characterize  the 
survival  functions  of  all  cascades.  Shown  in  Figure [2(a)]  the  red  lines  represent 
the  survival  functions  of  viral  cascades,  and  the  blue  lines  show  the  non-virals’. 
Then  we  are  supposed  to  divide  the  estimated  survival  functions  of  all  cascades 
into  two  classes  (viral  and  non- viral).  In  other  word,  we  need  to  find  a  survival 
boundary.  As  is  illustrated  in  Figure [2(b)]  the  red  dashed  line  separates  the  two 
categories  of  blue  (non-viral  cascades)  and  red  (viral  cascades). 

Previous  works  m  have  demonstrated  that  at  a  fixed  observing  time  £,  the 
distribution  of  survival  probability  of  different  cascades  obeys  Gaussian  distri¬ 
bution.  Based  on  this  knowledge,  we  employ  two  random  variables:  (for  viral 

cascades)  and  /*  (for  non-viral  cascades)  subject  to  time  £,  which  satisfy  the 
Gaussian.  Formally,  we  specify  this  assumption  in  Definition  [3] 

Definition  3.  For  any  Given  time  t ,  we  have  f*  rsj  V(/4 ,  4)  and  fn  ~  V(/4,  *1), 
where  /jfv,  afv  and  gfn,  cr^  are  the  parameters  of  Gaussian  distribution  for  viral  and 
non-viral  cascades  subject  to  time  t. 

Based  on  Definition  [3J  for  a  given  time  £,  the  survival  probability  of  viral  and 
non-viral  cascades  can  be  respectively  characterized  as  f\  and  /*.  Therefore, 
the  task  to  find  the  optimal  survival  boundary  is  to  give  the  suitable  separation 
between  two  Gaussian  distributions. 

Definition  4.  (Survival  Boundary):  for  any  given  time  t,  assume  the  survival 
boundary  to  be  S*(t),  which  is  given  by  the  following  formula, 


rS*(t ) 
J  —  oo 


exp  - 


(*-/4)2 

2  at 


dx 


r+ oo 

J S*(t) 


\Z2tto 


exp  - 


(s~/4>)2 

2aV 


da:. 


Then  the  optimal  survival  boundary  can  be  calculated  as  S*(t) 


cr* +cr* 


(6) 


n  r 

fli 

IT 

.  II -C<d-  l|H  ■■histogram  of  non-viral  cascades  U.  . 

- fitting  curve  of  non-viral  cascades 

histogram  of  viral  cascades 
■ - fitting  curve  of  viral  cascades 

)  0.1  0.2  0.3  0.4  0.5  0 

.6  0.7  0.1 

8  0.9  1 

survival  probability 


(a)  Survival  Frequency  and  Fitting 


Fig.  3.  Survival  Frequency  and  Survival  Boundary  at  Time  t 
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As  is  shown  in  Figure [3(a)]  given  time  £,  we  plot  the  frequency  histograms  of 
survival  probabilities  of  both  viral  and  non- viral  cascades  (blue  bars  represent 
non- viral  ones,  and  red  bars  represent  viral  ones).  Then  we  use  two  Gaussian 
distribution  curves  /*  and  fln  to  fit  these  two  histograms.  Next,  to  simplify  our 
problem,  we  employ  the  cumulative  distribution  function  of  /*  and  /*,  respec¬ 
tively  denoted  as  F*(s)  and  F^(s),  specifically  we  have, 

Ffo)  =  P(S  <s)=  [S  -L-  exp  (- ~  dx,  (7a) 

J —oo  V  2i7T(7v  \  2(JV  J 

(7b) 


Finally,  we  plot  F*(s)  and  F*(s)  in  Figure [3(b) ,  and  the  x-coordinate  of  the  only 
intersection  S*(t)  is  the  optimal  survival  boundary  subject  to  time  t. 


3.2  Well-Definedness  of  Survival  Boundary 


In  order  to  make  the  problem  more  complete  and  rigorous,  in  this  subsection, 
we  mainly  discuss  the  monotonicity  of  the  survival  boundary,  which  is  given  in 
Definition  [4j  i.e.,  we  will  prove  that  the  optimal  survival  boundary  is  itself  a 
survival  function. 

In  fact,  during  the  observation  period,  we  conclude  three  solid  facts.  First 
of  all,  the  survival  probabilities  of  both  viral  and  non-viral  cascades  are  natu¬ 
rally  monotonic  decreasing  with  time  £,  so  the  average  survival  probabilities  of 
both  cascades  are  also  monotonic  decreasing.  Besides,  non-viral  cascades  intu¬ 
itively  possess  a  higher  survival  probability,  thus  the  average  survival  probability 
for  non-viral  cascades  g^  is  reasonably  larger  than  that  of  viral  ones  gfv.  Fur¬ 
ther  more,  real-word  data  shows  that  the  survival  probability  range  of  non-viral 
cascades  appears  to  be  more  dynamic  and  uncertain,  which  means  its  relative 
fluctuation  of  standard  deviation  oln  is  also  larger  than  olv.  Formally,  we  specify 
these  three  conclusions  in  Lemma  [l] 

Lemma  1.  For  any  given  time  t,  gfv,  olv  and  gfn,  oln  respectively  represent  the 
average  survival  probability  and  its  standard  deviation  of  viral  and  non-viral  cas¬ 
cades.  Given  time  t'  >  t,  we  have 


f  /4  >  /  'f 

l  /4  >  Mn 


Mn  > 


cri  -  crt 


> 


at'  ~  at 


V  0  <t<t'. 


(8) 


Based  on  Definition [4] and  Lemma[lJ  we  given  detailed  proof  that  the  optimal 
survival  boundary  is  itself  a  survival  function. 

Theorem  1.  The  optimal  survival  boundary  S*(t )  is  monotonic  decreasing  with 
time  t ,  i.e.,  S*(t)  is  also  a  survival  function.  Formally,  we  have 


S*(t )  >  S'*(0, 


V  0  <  t  < 


(9) 
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Proof.  For  V  0  <  t  <  £',  we  have 
S*(t)  -  s*(t ')  =  +  + 


> 


A  +  <4  <4  +  of 

{An  ~  AvAla*  +  [Av  -  /4Vn<4'  +  (/4  ~  A^PaAL  +  [An  ~  AnAl^j 

04  + <4)04  +  <4)  (io) 

{Av  -  AvAWn  +  04  -  AnAWn  +  (A  ~  AvAWn  +  {An  ~  AnAWv 


04  +  <4)04  +4') 


>0, 


according  to  Lemma [l]  We  can  easily  conclude  that  >  S*(t'). 


3.3  Hazard  Ceiling:  a  Dynamic  Perspective 

As  is  defined  in  Definition  [2j  hazard  function  is  specifically  denoted  as  h(t)  = 
_ds_(t)  <  _i_^  w0  can  easpy  monitor  the  hazard  function  h(t)  of  a  cascade  when 
given  its  survival  function  S(t). 

To  detect  the  early  pattern  of  outbreak  cascades,  many  previous  works  usu¬ 
ally  ignore  the  underlying  arrival  process  of  retweets,  instead,  they  only  consider 
the  relationship  between  the  static  size  of  cascade  and  a  predefined  threshold  [6] 
m,  then  determine  whether  the  cascade  is  suffering  a  burst  period.  However, 
before  the  static  size  of  a  cascade  accumulates  to  a  certain  threshold,  its  burst 
pattern  can  be  exactly  uncovered  from  dynamic  information,  such  as  the  hazard 
function  h(t)  in  this  problem.  Intuitively,  we  conclude  that  if  at  a  certain  time 
to,  the  hazard  function  h(t)  of  a  cascade  suddenly  rises  above  a  hazard  ceiling  <a, 
in  other  word,  h(t{f)  >  <a,  we  deem  that  the  burst  period  of  this  cascade  begins. 


Fig.  4.  Hazard  Functions  and  Hazard  Ceiling 

However,  instead  of  utilizing  a  fix  threshold,  we  employ  the  baseline  hazard 
function  with  a  5%  hazard-tolerant  interval  as  hazard  ceiling  (illustrated  in  Fig¬ 
ure  [4| ,  since  intuitively  the  characteristics  of  cascades  may  vary  a  lot  during  the 
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diffusion  process.  In  Figure  [4j  the  hazard  ceiling  is  drawn  in  red  dash  line  with 
a  grey  hazard-tolerant  interval,  and  the  red  solid  line  and  blue  solid  line  respec¬ 
tively  denote  the  hazard  functions  of  a  viral  cascade  and  a  non-viral  cascade. 
We  can  clearly  conclude  that  the  blue  line  never  exceeds  hazard  ceiling  <a,  and 
the  red  line  exceeds  a  and  its  hazard-tolerant  interval  at  t hazard-  Therefore,  we 
deem  that  at  t hazard,  this  cascade  goes  viral  and  starts  to  burst. 

3.4  Incorporation  of  Two  Techniques 

In  this  subsection,  we  conclude  our  method  and  integrate  survival  boundary  and 
hazard  ceiling.  The  whole  process  of  EPOC  is  shown  in  Alg.  [l] 

Algorithm  1:  Algorithm  of  EPOC 
Input:  training  data  D,  test  data  IT,  threshold  p,  hazard  ceiling  a. 
Output:  status  vector  V ,  detect  time  T. 

Set  labels  for  each  cascade  from  D  using  threshold  p  ; 

Train  a  Cox’s  model  C  with  time-dependent  data  D  ; 

Initialize  survival  function  set  as  S  ; 
foreach  d  in  D  do 

estimate  the  survival  function  Sd(t)  of  d  using  C  ; 
add  Sd(t)  to  S; 

Train  an  optimal  survival  boundary  S*  with  S  ; 

foreach  df  in  D'  do 

estimate  the  survival  function  S#  (t)  and  hazard  function  hdf  (t)  of  d!  ; 
if  Sd'(t)  firstly  falls  down  below  S*(t)  at  time  to  then 
add  1  to  S'  ; 

if  hd'  (t)  firstly  rises  up  above  a  at  time  t\  then 
[  add  min{to,ti}  to  T  ; 

else 

add  t0  to  T  ; 

else 

add  0  to  S  ; 
add  none  to  T  ; 

return  S  and  T. 

In  Alg.  [TJ  Linel~ Lined  is  the  initialization,  and  especially  we  train  the  Cox’s 
model  with  time-dependent  features  in  Line2.  Then  the  optimal  survival  bound¬ 
ary  is  estimated  in  Line4~Line7 ,  after  that,  we  detect  the  burst  pattern  between 
Line8  and  Linel8  using  both  survival  probability  and  hazard  rate. 

4  Experiments 

In  this  section,  we  conduct  comprehensive  experiments  to  verify  our  model  in 
early  pattern  detection  of  outbreak  cascades.  Firstly,  we  describe  the  data  sets 
(Twitter  and  Weibo)  and  five  comparative  state-of-the-art  baselines  in  detail. 
Then  we  conduct  our  experiments  as  well  as  providing  corresponding  analysis. 
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4.1  Data  Sets 

We  implement  our  model  EPOC  on  two  large  real-world  data  sets:  Twitter  and 
Weibo.  Twitter  is  one  of  the  most  famous  social  platforms  in  the  world  with 
annually  0.5  billion  users.  We  densely  crawl  the  tweets  that  contains  hashtags 
with  Twitter  search  API.  In  our  experiments,  a  cascade  is  considered  to  consist  of 
all  tweets  with  the  same  hashtag.  Another  large  dataset  Weibo  is  from  an  online 
resour c^]  However,  different  from  Twitter,  due  to  the  sparsity  of  hashtags  in 
Weibo,  a  cascade  is  defined  by  the  diffusion  of  a  single  microblog.  More  detailed 
information  of  two  data  sets  can  be  found  in  Table  [U 

Table  1.  Data  sets  information 


Data  set 

9^  of  cascades 

Type 

Range 

Year 

Size(GB) 

Twitter 

166,076 

hashtag 

Aug. 13th  -  Sep. 10th 

2017 

3.827 

Weibo 

300,000 

microblog 

Sept. 28th  -  Oct. 29th 

2012 

1.426 

4.2  Experiment  Setting 

For  our  model  implementation,  we  need  to  specify  some  settings.  Because  large 
cascades  are  rare  m,  in  this  paper,  we  set  threshold  for  viral  and  non-viral 
cascades  to  be  95  percentile  in  both  Twitter  and  Weibo,  where  a  larger  size  will 
be  regarded  as  viral  cascade,  and  otherwise  non-viral.  As  cascades  are  formed 
by  large  resharing  activities  and  can  potentially  reach  a  large  number  of  people 
m,  we  only  consider  the  cascades  with  a  tweet  count  larger  than  50  in  Twitter 
and  filter  out  the  remains.  As  for  Weibo,  the  out  line  is  set  to  be  80. 

In  the  outset  of  our  experiments,  we  randomly  divide  each  data  set  into  two 
parts,  80%  of  the  cascades  is  employed  as  training  data,  and  the  remaining  one- 
fifth  as  test  data.  As  for  the  hazard  ceiling,  in  this  paper,  we  use  the  baseline 
hazard  function  as  ceiling  and  set  5%  as  the  hazard-tolerant  interval. 

4.3  Baselines 

From  previous  literatures,  we  select  a  variety  of  approaches  from  different  per¬ 
spectives  to  compare  our  EPOC:  traditional  machine  learning  methods,  Bayesian 
methods,  survival  methods,  and  time  series  methods. 

—  Linear  Regression  ( LR ):  Linear  regression  is  a  simple  and  feasible  way  to 
characterize  the  relationship  between  variables  and  final  result.  In  this  paper, 
we  divide  the  observation  time  into  twelve  time  periods,  then  implement  LR 
with  LI  regularization  based  on  different  time  periods,  utilizing  the  observed 
information  to  predict  whether  or  when  a  cascade  goes  viral. 

—  Support  Vector  Regression  (SVR):  As  is  widely  used  in  various  areas,  SVR  is 
a  powerful  regression  model.  We  use  SVR  with  Gaussian  kernel  as  a  baseline 
to  predict  whether  a  cascade  will  go  viral  or  even  burst  in  the  near  future. 
More  detailed  implementation  of  SVR  is  similar  to  linear  regression. 

1  arnet  miner .  org/Influencelocality 
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—  PreWhether  [12]:  From  a  Beyassian  perspective,  Pre Whether  is  one  of  the 
pioneers  in  social  content  prediction,  which  utilizes  three  temporal  features 
(sum,  velocity,  and  acceleration)  to  infer  the  content  ultimate  popularity.  In 
our  experiments,  we  also  use  the  same  time  period  manner  to  implement 
PreWhether. 

—  SEISMIC  [10; :  SEISMIC  is  a  point  process  based  time  series  model,  which 
takes  individual’s  influence  into  consideration.  Since  the  model  itself  is  de¬ 
signed  to  predict  the  popularity  of  single  tweets  in  social  networks,  we  extend 
it  to  suit  our  goals  of  cascades’  burst  pattern  detection. 

—  SansNet  [8] :  SansNet  is  a  network-agnostic  approach  proposed  in  recent 
literature,  which  also  regards  the  burst  detection  task  as  a  judgement  of 
viral  and  non- viral.  This  method  shows  its  detection  performance  using  only 
the  time  series  information  of  a  cascade. 

4.4  Burst  Pattern  Detection 

Burst  or  Not:  to  detect  the  early  pattern  of  outbreak  cascades,  we  primarily 
divide  this  problem  into  two  steps.  Firstly,  we  detect  whether  a  cascade  will 
outbreak  based  on  the  observed  information.  Since  large  cascades  are  arguably 
more  striking  m,  in  this  classification  task,  we  employ  two  special  metrics:  k- 
coverage  and  Cost,  k- coverage  mainly  focuses  on  those  cascades  with  a  very  large 
size.  Specifically,  it  is  calculated  by  (k  >  n),  where  k  is  the  number  of  the 
largest  cascades  being  concentrated  on,  and  n  denotes  the  number  of  cascades  we 
successfully  detect  from  the  top -k  viral  cascades.  Here  in  this  work,  n  equals  50. 
Cost  (more  precisely  called  sensitive  cost)  is  a  targeted  metric,  which  is  selected 
to  handle  the  problem  of  unequal-cost.  If  a  viral  cascade  (like  a  rumor  [I])  is 
classified  to  be  non-viral,  it  will  cost  a  lot  when  this  cascade  gets  larger  and 
causes  a  big  trouble.  On  the  contrary,  if  we  misclassify  a  non-viral  cascade,  it 
only  costs  some  additional  labor.  Cost  is  specified  in  Eqn. 

„  FNR  x  p  x  Cost p n  +  FPR  x  (1  —  p)  x  Costpp  ^  N 

Cost  =  - — - - - - - r - — - - - ,  (11) 

p  x  C  ostpjsf  +  (1  —  p)  x  Costpp 

where  FNR  is  the  false  negative  rate,  FPR  is  the  false  positive  rate,  p  is  the 
proportion  of  viral  cascades  in  all  cascades,  Costpp  and  Costpp  are  entries  in 
cost  matrix.  We  also  specify  the  cost  matrix  in  Table  [2] 

Table  2.  Unequal-Cost  Matrix 


Real  Class 

|  Detected  Class 

j  Viral 

|  Non-viral 

Viral 

I  C  OStT  P  =  0 

|  C  ost  F  N  5 

Non-viral 

I  Cost  F  p  =  1 

I  CostTN  —  0 

Performance  Analysis.  The  results  of  burst  detection  are  aggregated  in  Table  [3] 
and  the  underlined  numbers  show  the  best  results.  One  can  see  that  in  gen¬ 
eral,  our  EPOC  performs  relatively  better  than  five  baselines  in  terms  of  both 
/^-coverage  and  Cost.  LR  also  shows  great  performace  in  k- coverage  on  Weibo, 
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and  it  works  much  better  than  SVR  and  SEISMIC,  which  means  that  the  LI 
regularization  comes  into  effect.  As  a  probabilistic  model,  Pre Whether  gives  a 
slightly  poor  detection  result  due  to  the  assumption  that  all  the  features  are 
independent.  Though  less  effective  than  EPOC,  SansNet  outperforms  all  the 
other  baselines  in  this  classification  task,  since  SansNet  only  employs  one  fea¬ 
ture  from  cascades.  However,  it  is  plausible  to  note  that  SansNet  gives  stable 
k- coverage  and  Cost  results  in  both  Twitter  and  Weibo,  which  indicates  that 
survival  perspective  models  are  suitable  in  this  scenario. 


Table  3.  Result  of  burst  detection  on  Twitter 


1 

LR 

SVR 

PreWheter 

SEISMIC 

SansNet 

EPOC 

Twitter  1 

1 

k- coverage 

Cost 

0.7781 

0.1032 

0.5969 

0.0998 

0.7490 

0.0956 

0.5188 

0.1677 

0.8275 

0.0776 

0.8471 

0.0701 

Weibo  1 

1 

k- coverage 

Cost 

0.6805 

0.0951 

0.4918 

0.1229 

0.6512 

0.1271 

0.4589 

0.1581 

0.7720 

0.0961 

0.7784 

0.0881 

Change  of  Observation  Periods.  To  explore  the  connection  between  observing 
period  and  the  performance  of  methods,  we  conduct  experiments  on  Twitter 
with  six  time  periods  from  0.5  to  3  hours  and  organize  the  results  in  Figure  [5] 
Intuitively,  the  performances  of  EPOC  and  five  baselines  improve  gradually  as 
the  observing  period  increases.  We  can  clearly  see  that  EPOC  performs  the  best 
with  a  pretty  high  ^-coverage  at  about  87%  and  a  pretty  low  cost  at  around  0.068. 
Besides,  it  is  worth  noticing  that  SEISMIC  is  far  behind  other  approaches  no 
matter  in  ^-coverage  or  in  Cost,  which  suggests  that  time  series  model  depends 
on  a  relatively  longer  observing  period,  and  can  not  do  a  good  job  the  burst 
detection  task  at  early  stage. 


-♦-LR  -B-SVR  -A-Prewhether  -i-SEISMIC  -#-SansNet  -•-EPOC  -#-LR  -«-SVR  -A-Prewhether  -^SEISMIC  -#-SansNet  -#-EPOC 


Fig.  5.  k- Coverage  and  Cost  under  Different  Observing  Periods  on  Twitter 

Time  Ahead  (similar  to  EPA  from  [8]):  further,  we  try  to  figure  out 
how  early  we  can  detect  the  outbreak  cascades  with  EPOC.  As  m  states,  it  is  a 
pathological  task  to  estimate  the  final  size  of  a  cascade  if  only  given  a  short  initial 
portion,  since  almost  all  cascades  are  small.  Besides,  comparing  with  getting  the 
final  size  of  a  cascade,  it  is  more  meaningful  and  practical  to  detect  how  early 
a  cascade  will  break  out.  Therefore,  in  this  experiment  of  Twitter  and  Weibo, 
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we  only  probe  into  the  early  pattern  of  outbreak  cascades,  and  mainly  focus 
on  absolute  time  ahead ,  which  is  the  interval  between  the  predicted  burst  time 
t predict  and  the  actual  burst  time  tactuai.  Specifically  during  the  experiments, 
if  tactual  >  tPr edict ,  we  record  as  tactuai  -  tpr edict,  and  otherwise,  0.  Also,  we 
consider  the  relative  time  ahead ,  which  is  given  by  tactu^~tPredict  or  q. 


Fig.  6.  Absolute  and  Relative  Time  Ahead  on  Twitter  and  Weibo 

Performance  Analysis.  Figure  [6]  illustrates  the  corresponding  experiment  results 
on  Twitter  and  Weibo.  We  conclude  that  all  the  methods  have  a  similar  rank  in 
terms  of  absolute  time  ahead  and  relative  time  ahead.  SansNet  and  our  EPOC 
steadily  keep  a  leading  role  in  this  regression  task  at  about  38.75%  and  40.12% 
respectively  ahead  of  the  actual  burst  time  in  Twitter.  Pre  Whet  her  and  LR 
work  mildly,  and  they  can  successfully  predict  the  occurrence  of  burst,  when 
the  diffusion  process  of  cascades  only  goes  on  about  two  thirds.  Though  SVR 
possesses  much  better  performance  than  the  poorest  SEISMIC,  it  falls  behind 
comparing  with  other  baselines,  which  suggests  that  the  notion  of  support  vector 
may  not  be  applicable  in  this  problem. 

5  Related  Work 

In  recent  years,  social  networks  have  successfully  attracted  researchers’  attention, 
and  plenty  of  achievements  have  been  made  in  the  past  few  decades,  especially 
when  it  comes  to  the  study  of  information  cascades,  including  the  prediction  of 
cascade  size,  how  the  cascade  grows  and  disseminates,  etc. 

5.1  Information  Cascade  and  Social  Networks 

The  study  of  information  cascades  has  been  going  for  a  long  time,  and  it  is  of 
great  use  in  many  applications,  such  as  meme  tracking  [2],  stock  bubble  diag¬ 
nosis  [3],  and  sales  prediction  [4 ].  The  literature  concerning  cascade  in  social 
networks  can  be  divided  into  three  categories.  The  first  category  lays  on  user 
level  prediction.  One  of  the  pioneers  is  Iwata  et  al.  m ,  they  propose  a  Bayesian 
inference  model  with  stochastic  EM  algorithm,  trying  to  discover  the  latent  in¬ 
fluence  among  online  users.  [19]  also  utilizes  user-related  features  to  help  social 
event  detection.  Additionally,  some  other  researchers  also  analyze  the  topology, 
since  structural  feature  is  said  to  be  one  of  the  predictors  of  cascade  size  m ■ 
PageRank  of  retweeting  graph  is  taken  into  consideration  [20] ,  while  [21]  utilizes 
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the  number  of  directed  followers  as  one  of  the  important  infectors.  Another  sig¬ 
nificant  category  is  temporal  features.  Many  experimental  results,  such  as  mm, 
reveal  that  temporal  features  are  the  most  effective  type  of  indicators.  To  depict 
the  connection  between  early  cascade  and  its  final  state,  both  [5]  and  [12]  propose 
Bayesian  networks  with  temporal  information.  Other  temporal  information,  like 
mean  time  and  maximum  time  interval,  has  also  been  considered  [9]. 

5.2  Outbreak  Detection  and  Modeling 

Burst  or  outbreak,  defined  as  “a  brief  period  of  intensive  activity  followed  by  long 
period  of  nothingness”  [6  ,  is  a  common  phenomenon  during  the  diffusion  of  social 
content,  which  is  worthy  of  studying  and  may  bring  benefits  to  modern  society. 
Existing  works  probing  into  cascades  mainly  focus  on  prediction  of  its  future 
popularity  mmm  or  final  aggregate  size  mm  However,  how  to  detect  the 
burst  pattern  of  large  cascade  in  early  stage  remains  an  intriguing  problem. 
Recently,  based  on  the  transformation  of  time  window,  Wang  et  al.  [6]  proposes 
a  classification  model  to  predict  the  burst  time  of  cascade.  Unfortunately,  their 
approach  acquires  laborious  feature  extraction,  and  the  traditional  classifiers 
they  used  can  hardly  take  the  best  use  of  the  features.  ini  implements  a  logistic 
model,  which  considers  all  the  nodes  as  cascade  sensors.  Just  as  bad,  when  the 
number  of  nodes  in  networks  turns  to  be  billions,  the  implementation  of  this 
method  will  be  particularly  difficult. 

In  this  work,  adopting  survival  theory,  we  can  exactly  overcome  these  draw¬ 
backs  from  the  perspective  of  cascade  dynamics.  Other  researchers  also  employ 
survival  models  to  understand  the  burst  of  cascades.  SansNet  is  proposed  in 
|8],  predicting  whether  and  when  a  cascade  goes  viral.  This  approach  utilizes 
only  the  size  of  cascades  as  feature,  making  it  weak  to  apply  to  multiply  cases, 
since  the  features  of  an  author  [22]  and  the  inherent  network  m  are  sometimes 
more  important  than  features  from  cascade  itself  |22] .  Another  drawback  of  this 
approach  is  that  the  survival  curve  cannot  totally  reveal  the  status  of  cascades. 

6  Conclusion  and  Perspectives 

In  social  networks,  detecting  whether  and  when  a  cascade  will  outbreak  is  a 
non-trivial  but  beneficial  task.  In  this  paper,  we  novelly  employ  survival  theory, 
proposing  a  survival  model  EPOC  to  detect  the  early  pattern  of  outbreak  cas¬ 
cades.  We  extract  both  dynamic  and  static  features  from  cascades  and  utilize 
Gaussian  distributions  to  characterize  their  survival  probabilities,  then  accom¬ 
panied  with  hazard  rate,  we  successfully  detect  the  burst  pattern  of  cascades  at 
very  early  stage.  Extensive  experiment  shows  that  our  EPOC  outperforms  five 
state-of-the-art  methods  in  this  practical  task. 

As  future  work,  firstly  we  will  mainly  concentrate  on  how  to  choose  a  better 
standard  baseline  for  hazard  ceiling,  and  more  experiment  observation  might  be 
made.  Then,  we  will  consider  more  influential  and  relevant  features  or  try  an¬ 
other  suitable  survival  theory  based  model.  Finally,  we  hope  that  our  work  will 
pave  ways  to  richer  and  deeper  understanding  of  cascades. 
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